pRactice corner: Tidy Models - Regression

lruolin

Overview

This is an exercise for me to revise the tidymodels way of carrying out regression using linear regression (OLS) and random forest models. Only after going through the readings, my course materials, and actually trying to work through a dataset myself do I really appreciate what my Prof was trying to teach during his machine learning course..

I will be working on a dataset familiar to me - the white wine dataset.

Machine learning can be divided into supervised and unsupervised machine learning - the differentiating point is whether you know the outcome. In this case, I want to apply what I have learnt on predicting a known numerical outcome - this would be regression. If I want to predict a known categorical outcome - for eg whether the wine is red or white, then that would be classification. If I am unsure what are the types of wine, and just want to group them, then that would be clustering.

For regression, I could work on predicting the quality of wine from the white wine dataset, the quality of wine from the red wine dataset, or the quality of wine from both the red and white wine dataset.

As a start, let me try to predict the quality score of white wine from the various attributes.

General Workflow for Predicting Numerical Outcome using Linear Regression and Random Forest

Most of the points mentioned below were taken from: https://jhudatascience.org/tidyversecourse/model.html#summary-of-tidymodels.

This was a great read for me to frame my learning and see the whole picture

The general steps are outlined below:

1. Import the data

I will import the white wine dataset from https://archive.ics.uci.edu/ml/datasets/wine+quality. The attributes are:

Features/X variables

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Outcome/Y variable: 12 - quality (score between 0 and 10)

2. Define the question you want to ask the data

Can I predict the quality score (Y variable), based on the physicochemical attributes of white wine (X variables)?

3. Clean and transform the data

In this case, the data is already in a tidy format. As for transformations required, I will leave it to the pre-processing step later.

4. Exploratory data analysis (EDA)

What do you look out for when doing EDA?

Shape of data: How many Y and how many X variables are there? How many observations are there? The number of observations (n) and number of parameters (p) may affect your choice of models.
Type of variables: are they numerical or categorical? Or are they numerical, but actually can be transformed into categorical (eg month of the year), or are they categorical, but should be transformed into dummy variables (eg for regression)?
Are there any missing values? This may affect the modelling as some models cannot handle missing values.
Within each variable, what is the min, max, mean, median, range? Are the datapoints normally distributed, or skewed? This affects whether modelling, for eg OLS regression can be carried out, or should further transformations be carried out first?
How are the X variables related to each other? Are they correlated? What is the strength of correlation? Is there a multi-collinearity issue? These are points to be addressed for linear regression.

5. Preprocessing the data

After identifying all the “touch-up” points required for successful modelling, the data may be pre-processed using the recipes package. This is really a powerful package that can transform the data the way you want, using single-line commands (as compared to multiple clicks of the mouse). However, it requires the user to know what steps are to be taken.

A list of preprocessig steps is given below:

https://recipes.tidymodels.org/reference/index.html#section-basic-functions

Like cooking, this is part art part science. If I want to do missing values imputation, which kind of imputation do I use? I am still learning as I go along for this step..

In a way, this preprocessing step helps you to zoom in razor sharp to the important X variables that can be used to predict Y. These X variables may exist as hidden variables that need carving out and polishing/transformation. There may be X variables that are of not much importance, so it is important to extract relevant information and keep the number of variables as small as possible without compromising on accuracy. In other words, that is called feature engineering.

6. Carry out modelling on the training dataset

Split dataset into training, testing and cross-validation. The training dataset is for you to train models with. The cross-validation dataset should be a subset of training dataset, for tuning different parameters of the model to make it an even better model. Cross-validation is useful when the number of observations isn’t big enough to split into three different datasets for training, validation and testing. Instead, the data is randomly partitioned into subsamples to be used as training dataset for one partition, and the same subsample would be used as test dataset for another partition. In repeated cross-validation, the cross-validation is repeated many times, giving random partitions of the original sample. The test dataset is for testing the trained model to see if the model is able to deliver predictive results.
Usually, you will train more than one model, and then compare the results. The choice of model could span over simple, easy to understand linear models, or difficult to interpret but accurate models (eg neural networks which is like a blackbox). The results of the trained dataset will usually not fare too badly, since the model was built using the training dataset. Different models may require different preprocessing and different types of parameter tuning.

7. Assessing the test dataset.

A litmus test of whether the model works is to look at how well the model performs its predictive task when a set of completely new data is provided to the model. Models may be assessed in terms of r-sq, root mean square error or mean absolute error to judge the performance.

The indicators above give us an understanding of how accurate the model is in terms of predicting new data. A model that is over-fitted fits the training dataset well, but is unable to predict the test dataset well. A model that can fit the test dataset well, may not be accurate enough to give good predictions. This is known as the bias-variance tradeoff.

8. Communicate the modelling results

Results should be shown as visualizations/data tables to communicate the findings.

Load required packages

Load required packages:

library(pacman)
p_load(tidyverse, janitor, GGally, skimr, ggthemes, ggsci,
       broom, gvlma, ggfortify, jtools, car, huxtable, sandwich,
       tidymodels, parsnip, vip)

Import

white_rawdata <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", 
                            strip.white = T, sep = ";", header = T) %>% 
  as_tibble() 

# save as another variable 
data_white <- white_rawdata

Exploratory data analysis

glimpse(data_white) # 12 variables, all numerical

Rows: 4,898
Columns: 12
$ fixed.acidity        <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0,…
$ volatile.acidity     <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.3…
$ citric.acid          <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.1…
$ residual.sugar       <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.…
$ chlorides            <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.05…
$ free.sulfur.dioxide  <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28,…
$ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 1…
$ density              <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956,…
$ pH                   <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.1…
$ sulphates            <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.4…
$ alcohol              <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.…
$ quality              <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, …

summary(data_white) # scale and range of values are quite different

 fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
 Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
 1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
 Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
 Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
 3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
 Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
   chlorides       free.sulfur.dioxide total.sulfur.dioxide
 Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
 1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
 Median :0.04300   Median : 34.00      Median :134.0       
 Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
 3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
 Max.   :0.34600   Max.   :289.00      Max.   :440.0       
    density             pH          sulphates         alcohol     
 Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
 1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
 Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
 Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
 3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
 Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
    quality     
 Min.   :3.000  
 1st Qu.:5.000  
 Median :6.000  
 Mean   :5.878  
 3rd Qu.:6.000  
 Max.   :9.000

skim(data_white) # no missing values, probably need to normalize data

Table 1: Data summary
Name	data_white
Number of rows	4898
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fixed.acidity	1	6.85	0.84	3.80	6.30	6.80	7.30	14.20	▁▇▁▁▁
volatile.acidity	1	0.28	0.10	0.08	0.21	0.26	0.32	1.10	▇▅▁▁▁
citric.acid	1	0.33	0.12	0.00	0.27	0.32	0.39	1.66	▇▆▁▁▁
residual.sugar	1	6.39	5.07	0.60	1.70	5.20	9.90	65.80	▇▁▁▁▁
chlorides	1	0.05	0.02	0.01	0.04	0.04	0.05	0.35	▇▁▁▁▁
free.sulfur.dioxide	1	35.31	17.01	2.00	23.00	34.00	46.00	289.00	▇▁▁▁▁
total.sulfur.dioxide	1	138.36	42.50	9.00	108.00	134.00	167.00	440.00	▂▇▂▁▁
density	1	0.99	0.00	0.99	0.99	0.99	1.00	1.04	▇▂▁▁▁
pH	1	3.19	0.15	2.72	3.09	3.18	3.28	3.82	▁▇▇▂▁
sulphates	1	0.49	0.11	0.22	0.41	0.47	0.55	1.08	▃▇▂▁▁
alcohol	1	10.51	1.23	8.00	9.50	10.40	11.40	14.20	▃▇▆▃▁
quality	1	5.88	0.89	3.00	5.00	6.00	6.00	9.00	▁▅▇▃▁

data_white %>% 
  ggpairs() # distribution of X is quite skewed, except maybe for pH.

            # outcome is not unimodal --> ? 

data_white %>% 
  ggcorr(label = T, label_alpha = T, label_round = 2) # some collinearity issues?

Seems like OLS linear model isn’t really the best option for this case, since Y is multi-modal. It may be more suitable for classification to predict Y instead.

Nevertheless, let me compare the performance of OLS linear model with random forest, just for me to familiarise myself with the workflow for regression.

Tidymodels

Splitting the dataset into training and testing datasets

set.seed(202102212)
whitewine_split <- initial_split(data_white, prop = 0.8)

whitewine_train <- training(whitewine_split)
whitewine_test <- testing(whitewine_split)

# split training dataset for cross-validation
set.seed(20210228)
white_cv <- vfold_cv(whitewine_train) # split training dataset for tuning mtry later

Splitting the dataset into a training and testing dataset helps to minimise over-fitting of the model. Over-fitting the model would mean that the model fits the existing data very well, but is unable to predict for new data accurately.

Preprocessing

The aim of preprocessing would be to solve the multi-collinearity issue, transform the data so that the distribution is not skewed, as well as to normalize the data.

whitewine_reciped <- whitewine_train %>% 
  recipe(quality ~., .) %>% 
  step_log(all_predictors(), -pH, offset = 1) %>% # do not use all numeric since will affect Y (outcome)
  step_corr(all_predictors(), threshold = 0.5) %>%  # remove variables with r-sq > 0.5
  step_normalize(all_predictors()) # means centering and scaling

whitewine_reciped

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor         11

Operations:

Log transformation on all_predictors(), -pH
Correlation filter on all_predictors()
Centering and scaling for all_predictors()

Train the data recipe

whitewine_preprocessed <- prep(whitewine_reciped, verbose = T)

oper 1 step log [training] 
oper 2 step corr [training] 
oper 3 step normalize [training] 
The retained training set is ~ 0.29 Mb  in memory.

ww_transformed <- whitewine_preprocessed %>% bake(new_data = NULL) # see preprocessed data
ww_transformed %>%  as_tibble() %>% round(., digits = 3) %>%  head()

┌────────────────────────────────────────────────────────────────── │ fixed.ac volatile citric.a residual chloride free.sul
│ idity .acidity cid .sugar s fur.diox
│ ide
├────────────────────────────────────────────────────────────────── │ 0.226 -0.042 0.264 1.83 -0.023 0.682
│ -0.643 0.27 0.093 -1.1 0.171 -1.47
│ 0.46 -0.471 -0.08 0.691 0.604 0.763
│ 0.46 -0.471 -0.08 0.691 0.604 0.763
│ 1.45 0.063 0.598 0.436 0.219 -0.075
│ 0.226 -0.042 0.264 1.83 -0.023 0.682
└──────────────────────────────────────────────────────────────────

Column names: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, pH, sulphates, alcohol, quality

6/10 columns shown.

# check for missing values 
ww_transformed %>% 
  map(is.na) %>%  
  map_df(sum) %>% 
  tidy() %>% 
  select(column, mean) %>%  # no missing values
  as_tibble()

               ┌──────────────────────────────┐
               │ column                  mean │
               ├──────────────────────────────┤
               │ fixed.acidity              0 │
               │ volatile.acidity           0 │
               │ citric.acid                0 │
               │ residual.sugar             0 │
               │ chlorides                  0 │
               │ free.sulfur.dioxide        0 │
               │ pH                         0 │
               │ sulphates                  0 │
               │ alcohol                    0 │
               │ quality                    0 │
               └──────────────────────────────┘

Column names: column, mean

Specify the models

Linear regression (OLS)

ww_lr_model <- linear_reg() %>% 
  set_engine("lm") %>% # there are other options available, eg glmnet
  set_mode("regression") # could also be classification for certain models, so just specify as best practice to be clear

ww_lr_workflow <- workflow() %>% 
  add_recipe(whitewine_reciped) %>% 
  add_model(ww_lr_model)

#  fit model to training data, and get predicted values
final_ww_lm_model_fit <- fit(ww_lr_workflow, whitewine_train)

# understanding the lm model

lm_fit_output <- final_ww_lm_model_fit %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  as_tibble()

lm_fit_output

┌────────────────────────────────────────────────────────────┐
│ term          estimate   std.error   statistic     p.value │
├────────────────────────────────────────────────────────────┤
│ (Intercept    5.88          0.012     488        0         │
│ )                                                          │
│ fixed.acid   -0.0459        0.0139     -3.3      0.000983  │
│ ity                                                        │
│ volatile.a   -0.199         0.0126    -15.8      2.61e-54  │
│ cidity                                                     │
│ citric.aci   -0.000888      0.0129     -0.0686   0.945     │
│ d                                                          │
│ residual.s    0.121         0.0142      8.53     1.98e-17  │
│ ugar                                                       │
│ chlorides    -0.0194        0.0133     -1.45     0.147     │
│ free.sulfu    0.125         0.0131      9.61     1.31e-21  │
│ r.dioxide                                                  │
│ pH            0.0165        0.0139      1.19     0.234     │
│ sulphates     0.0465        0.0123      3.78     0.000161  │
│ alcohol       0.458         0.0147     31.1      4.63e-190 │
└────────────────────────────────────────────────────────────┘

Column names: term, estimate, std.error, statistic, p.value

lm_fit <- final_ww_lm_model_fit %>% 
  pull_workflow_fit()

lm_fit

parsnip model object

Fit time:  8ms 

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
        (Intercept)        fixed.acidity     volatile.acidity  
          5.8775198           -0.0459302           -0.1989870  
        citric.acid       residual.sugar            chlorides  
         -0.0008882            0.1212696           -0.0193643  
free.sulfur.dioxide                   pH            sulphates  
          0.1254698            0.0165026            0.0465193  
            alcohol  
          0.4576764

# Looking at the fitted values:
lm_fitted_values <- lm_fit$fit$fitted.values

# another way, from workflow
lm_wf_fitted_values <- 
  broom::augment(lm_fit$fit, data = whitewine_train) %>% 
  select(quality, .fitted: .std.resid)

glimpse(lm_wf_fitted_values)

Rows: 3,919
Columns: 6
$ quality    <int> 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 6, 8, 5, 8…
$ .fitted    <dbl> 5.469029, 5.165576, 5.863142, 5.863142, 5.686045,…
$ .hat       <dbl> 0.0016112016, 0.0019222325, 0.0009142831, 0.00091…
$ .sigma     <dbl> 0.7532844, 0.7532139, 0.7533292, 0.7533292, 0.753…
$ .cooksd    <dbl> 8.032116e-05, 2.368036e-04, 3.023818e-06, 3.02381…
$ .std.resid <dbl> 0.705488442, 1.108851543, 0.181776968, 0.18177696…

# looking at variable importance
vip_lm <- final_ww_lm_model_fit %>% 
  pull_workflow_fit() %>% # extracts the model information
  vip(num_features = 10, 
      aesthetics = list(fill = "deepskyblue4")) + # most important factor is alcohol +
  labs(title = "Variable Importance: Linear Regression") +
  theme_few() +
  theme(axis.text = element_text(face = "bold", size = 14))

vip_lm

Random forest model

rf_model <- rand_forest() %>% 
  set_args(mtry = tune()) %>% 
  set_mode(mode = "regression") %>% 
  set_engine(engine = "ranger", importance = "impurity")

rf_model

Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

rf_workflow <- workflow() %>% 
  add_recipe(whitewine_reciped) %>% 
  add_model(rf_model)

rf_grid <- expand.grid(mtry = 3:7) # choose sqrt(no. of variables) usually

tuned_rf_results <- rf_workflow %>% 
  tune_grid(resamples = white_cv, # using cv dataset from training dataset
            grid = rf_grid,
            metrics = metric_set(rmse, rsq, mae))

model_results <- tuned_rf_results %>% 
  collect_metrics()

finalized_rf_param <- tuned_rf_results %>% 
  select_best(metric = "rmse") %>% 
  as_tibble()

finalized_rf_param #M TRY = 3

              ┌───────────────────────────────┐
              │   mtry   .config              │
              ├───────────────────────────────┤
              │      3   Preprocessor1_Model1 │
              └───────────────────────────────┘

Column names: mtry, .config

rf_model_b <- rand_forest() %>% 
  set_args(mtry = 3) %>% 
  set_engine(engine = "ranger", importance = "impurity") %>% 
  set_mode(mode = "regression")

rf_workflow_b <- workflow() %>% 
  add_recipe(whitewine_reciped) %>% 
  add_model(rf_model_b)

final_ww_rf_model_fit <- fit(rf_workflow_b, whitewine_train)

# understanding the rf model

# for random forest, need to set importance = impurity in set_engine() to extract this
vip_rf <- final_ww_rf_model_fit %>% 
  pull_workflow_fit() %>% # extracts the model information
  vip(num_features = 10, 
      aesthetics = list(fill = "darkorange"))+ # most important factor is alcohol
  labs(title = "Variable Importance: Random Forest") +
  theme_few() +
  theme(axis.text = element_text(face = "bold", size = 14))

vip_rf

Comparing Linear Regression (OLS) vs Random Forest (RF)

gridExtra::grid.arrange(vip_lm, vip_rf, nrow = 2)

Alcohol content was the most important variable for both OLS and random forest models.

Assessing on test data

results_train <- final_ww_lm_model_fit %>% 
  predict(new_data = whitewine_train) %>%  # use actual train data, not preprocessed data
  mutate(truth = whitewine_train$quality,
         model = "lm") %>% 
  bind_rows(final_ww_rf_model_fit %>% 
              predict(new_data = whitewine_train) %>% 
              mutate(truth = whitewine_train$quality,
                     model = "rf")) 
  
results_train %>% 
  group_by(model) %>% 
  metrics(truth = truth, estimate = .pred) %>% 
  as_tibble()

         ┌──────────────────────────────────────────┐
         │ model   .metric   .estimator   .estimate │
         ├──────────────────────────────────────────┤
         │ lm      rmse      standard         0.752 │
         │ rf      rmse      standard         0.277 │
         │ lm      rsq       standard         0.282 │
         │ rf      rsq       standard         0.934 │
         │ lm      mae       standard         0.584 │
         │ rf      mae       standard         0.198 │
         └──────────────────────────────────────────┘

Column names: model, .metric, .estimator, .estimate

results_test <- final_ww_lm_model_fit %>% 
  predict(new_data = whitewine_test) %>% 
  mutate(truth = whitewine_test$quality,
         model = "lm") %>% 
  bind_rows(final_ww_rf_model_fit %>% 
              predict(new_data = whitewine_test) %>% 
              mutate(truth = whitewine_test$quality,
                     model = "rf")) 

results_test %>% 
  group_by(model) %>% 
  metrics(truth = truth, estimate = .pred) %>% 
  as_tibble()

         ┌──────────────────────────────────────────┐
         │ model   .metric   .estimator   .estimate │
         ├──────────────────────────────────────────┤
         │ lm      rmse      standard         0.737 │
         │ rf      rmse      standard         0.589 │
         │ lm      rsq       standard         0.295 │
         │ rf      rsq       standard         0.56  │
         │ lm      mae       standard         0.581 │
         │ rf      mae       standard         0.435 │
         └──────────────────────────────────────────┘

Column names: model, .metric, .estimator, .estimate

When comparing rmse, rf has lower rmse in training dataset but the rmse value increased in the test dataset –> overfitting and cannot predict as well.This was the same for other indicators rsq and mean absolute error.

Bear in mind that in the first place, the outcome variable Y was multi-modal. This may be the reason why OLS wasn’t a suitable learner.

Visualizing the assessment

results_train %>% 
  ggplot(aes(x =  truth, y = .pred)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  geom_abline(col = "darkorange") +
  labs(x = "actual values (truth)",
       y = "predicted values",
       title = "Training dataset") +
  scale_x_continuous(breaks = c(1:10)) +
  facet_wrap( model ~ ., ncol = 2) +
  theme_few()

results_test %>% 
  ggplot(aes(x =  truth, y = .pred)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  geom_abline(col = "darkorange") +
  labs(x = "actual values (truth)",
       y = "predicted values",
       title = "Testing dataset") +
  scale_x_continuous(breaks = c(1:10)) +
  facet_wrap( model ~ ., ncol = 2) +
  theme_few()

Learning points

It took me a while to piece different pieces of the jigsaw together to see the whole picture for machine learning. Initially, I will be carrying out EDA blindly, simply using skim() because it is a convenient function, but not fully understanding what I should be looking out for. I would be doing pre-processing steps at random, depending on what I saw from other websites. Finally, I saw the light that the purpose of doing EDA was to understand what I should be doing for preprocessing!

It is always good to start with the simple OLS when I am learning regression. There are assumptions that must be met before doing OLS – these could be checked using the gvlma package, and you can carry out the necessary transformations before doing OLS. There are other types of linear regression, for example generalized linear model (GLM), which I should try as well.

The order of carrying out preprocessing steps matter!

The choice of all_numerical, all_predictors in the recipe step matters! In this case, all_numerical includes the Y variable. Although Y is multimodal, it is not skewed, so I should not log transform it (which is what would happen if I were to use step_log(all_numerical())). If I log-transformed Y, I would run into errors further along the script, as there are some bugs regarding predict function if Y is transformed. The OLS model performed relatively consistently in both training and test dataset. However, the RF model performed better in the training dataset, but performance was poorer in the test dataset. This suggested that the RF model, in this case, had over-fitting issues.

Next steps:

Try out regression on a dataset in which Y is suitable for regression analysis
Try out classification on wine dataset.
Try out step_dummy, which creates numerical variables out of categorical variables
Try out different algorithms and their tuning parameters

“There is only one corner of the universe you can be certain of improving, and that’s your own self.” - ― Aldous Huxley

This is just the beginning of my learning journey!

References

Comment on this article Share:

Tidy Models - Regression